Low-complexity methods for assessing distances between pairs of documents

ABSTRACT

Two sets X 2  and X 1  of histograms of words, and a vocabulary V are accessed. Each of the two sets is representable as a sparse matrix, each row of which corresponds to a histogram. Each histogram is representable as a sparse vector, whose dimension is determined by a dimension of the vocabulary. Two phases compute distances between pairs of histograms. The first phase includes computations performed for each histogram and for each word in the vocabulary to obtain a dense, floating-point vector y. The second phase includes computing, for each histogram, a sparse-matrix, dense-vector multiplication between a matrix-representation of the set X 1  of histograms and the vector y. The multiplication is performed to obtain distances between all histograms of the set X 1  and each histogram X 2 [j]. Distances between all pairs of histograms are obtained, based on which distances between documents can subsequently be assessed.

BACKGROUND

The invention relates in general to the field of computer-implementedmethods for assessing distances between pairs of documents, such asmethods known as Word Mover's Distance and Relaxed Word Mover'sDistance.

Modern computational systems are known, which are capable of ingesting,storing and searching across a prodigious amount of textual information.Efficient search through textual information should ideally allow bothhigh-quality results and speed of execution. Today, high-quality searchresults can be achieved thanks to a new class of text representationsbased on neural networks, such as the popular word2vec representation[1]. The algorithm word2vec capitalizes on neural networks to map wordsonto an appropriate vector space. In this embedding space, words thatare semantically synonymous will be close to each other. To capture thesimilarity between sentences, Kusner et al., proposed the Word Mover'sDistance (WMD) [2], which represents a promising method for evaluatingthe similarity between text documents, in terms of quality of searchresults obtained. WMD is an adaptation of the Earth Mover's Distance(EMD) [3], initially proposed to measure similarity between images. LikeEMD, WMD constructs a histogram representation of the documents andestimates the cost of transforming one histogram into the other.However, the computational complexity of both approaches scalescubically with the size of the histograms, making their applicationprohibitive on Big Data.

To mitigate the large complexity of WMD, Kusner et al., proposed using afaster, lower-bound approximation to WMD, called the Relaxed WordMover's Distance, or RWMD [2]. They showed that the accuracy achieved byRWMD is very close to that of WMD. According to the original paper, thetime complexity of the RWMD method grows quadratically with the size ofthe histograms, which represents a significant improvement with respectto the WMD method. However, quadratic complexity still limits theapplicability of RWMD to relatively small datasets. Such a complexitystill prevents computing pairwise similarities across millions ofdocuments, which would have extensive applications in clustering,classification, and querying of documents.

SUMMARY

According to a first aspect, the present invention is embodied as acomputer-implemented method for assessing distances between pairs ofdocuments. This method first comprises accessing: two sets X₂ and X₁ ofhistograms of words, where each of the histograms pertains to arespective document; and a vocabulary V of words. Each of the two setsX_(k), k=1, 2, is representable as a sparse matrix, each row of whichcorresponds to a histogram X_(k)[l]∈ X_(k), l ∈ {1, . . . , N_(k)}. Inaddition, each histogram X_(k)[l] is representable as a sparse vector,whose dimension is determined by a dimension of the vocabulary V.

The present methods essentially comprise two phases, which aim atcomputing distances between pairs of histograms (X₁[i], X₂[j]), ∀ i ∈{1, . . . , N₁} ∧∀ j ∈ {1, . . . , N₂}. The first phase includecomputations performed for each histogram X₂[j] ∈ X₂ and for each wordin the vocabulary V. There, a distance between said each word in V and aclosest entry in said each histogram X₂[j] is computed, so as to obtaina dense, floating-point vector)), whose dimension is determined by thedimension of the vocabulary V.

The second phase includes computing, again for said each histogram X₂[j]∈ X₂, a sparse-matrix, dense-vector multiplication between amatrix-representation of the set X₁ of histograms and the vector y (ascomputed during the first phase for said each histogram X₂[j] ∈ X₂). Themultiplication is performed as a matrix-vector product, in order toobtain distances between all histograms X₁[i], i ∈ {1, . . . , N₁} ofthe set X₁ and said each histogram X₂[j]. ATM. Performing the two-phaseprocess for each histogram X₂[j] allow distances between all pairs ofhistograms to be obtained, based on which distances between documentscan subsequently be assessed.

This approach makes it possible to markedly reduce the time complexityand storage requirements, compared to methods such as WMD and RWMD,while essentially retaining the quality of search results obtained.This, in practice, makes it possible for high-quality search resultssuch as provided by the WMD and RWMD methods applicable to massivedatasets.

In embodiments, the vector y is computed based on: a dense matrix W thatstores vector representations of words in the vocabulary V, and a densematrix T₂[j] that stores vector representations of words in X₂[j]. Forexample, the vector y is preferably computed as min(W ∘ T₂[j]^(T)),where ∘ is a modified matrix multiplication operation, wherein distancesare computed between word vectors of W and T₂[j]^(T). Row-wise minimumsof W ∘ T₂[j]^(T) are then computed to obtain min(W ∘ T₂[j]^(T)). Thisway, minimal distances between the words in the vocabulary V and thewords in X₂[j] can be obtained, which are stored in the vector y, inview of the subsequent sparse-matrix, dense-vector multiplication.

The above embodiments makes it possible to compute the costs of movinghistograms X₁[i], i ∈ {1 . . . N₁} to histograms X₂[j], j ∈ {1 . . .N₂}. Now, a tighter lower bound can be achieved by further computing thecosts of moving histograms X₂[j], j ∈ {1 . . . N₂} to X₁[i], i ∈ {1 . .. N₁}. That is, the above two-phase process can be repeated, wherein X₁and X₂ are swapped and the new round of computations are now run foreach X₁[i].

According to another aspect, the invention is embodied as a computerizedmethod for managing sets of documents. This additional method assumesthat, given a set X₁ of histograms of words, where each of thehistograms pertains to a respective document, the set X₁ corresponds toa first set of documents as currently stored in a pool of documents.Now, a new set (i.e., a second set) of documents may be received. There,a corresponding set X₂ of histograms of words is created, wherein eachhistogram pertains to a respective document of the new set of documentsreceived. Finally, distances between pairs of documents of the two setscan be assessed, according to embodiments of the method described above.

The new set of documents is preferably stored in the pool of documentsafter having assessed distances between pairs of documents of the twosets.

According to a final aspect, the invention can be embodied as a computerprogram product for assessing distances between pairs of documents. Thecomputer program product comprises a computer readable storage mediumhaving program instructions embodied therewith. The program instructionsare executable by one or more processors, to cause the latter toimplement steps according to the present methods.

Computerized methods and computer program products embodying the presentinvention will now be described, by way of non-limiting examples, and inreference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

FIGS. 1-3 are flowcharts illustrating high-level steps of methods forassessing distances between pairs of documents, according toembodiments.

FIGS. 4A-4B (collectively FIG. 4) and FIGS. 5A and 5B (collectively FIG.5) schematically depict components (documents, histograms, sets ofhistograms, vocabulary) and data structures (matrices, vectors) asinvolved in embodiments of the present methods;

FIG. 6 is another flowchart, which illustrates high-level steps ofmethods for managing sets of documents, according to embodiments; and

FIG. 7 schematically represents a computerized system, suitable forparallelizing steps performed in the present methods, as involved inembodiments of the invention.

The accompanying drawings show simplified representations of methods,devices or parts thereof, as involved in embodiments. Similar orfunctionally similar elements in the figures have been allocated thesame numeral references, unless otherwise indicated.

DETAILED DESCRIPTION

The aim of the present invention is to propose a low-complexity methodfor accurately assessing distances between pairs of documents, whichreduces the average time complexity when operating on large sets ofdocuments. In specific embodiments (see sect. 2), a linear-complexityimplementation of the RWMD method is achieved (henceforth referred to asLC-RWMD), which is functionally equivalent to the quadratic-complexityRWMD but results in orders of magnitude reduction of the runtime.

The following description is structured as follows. First, generalembodiments and high-level variants are described (sect, 1). The nextsections addresses more specific embodiments (sect. 2) and technicalimplementation details (sect. 3).

a. General Embodiments and High-Level Variants

In reference to FIGS. 1-5, a first aspect of the invention is described,which concerns a computer-implemented method for assessing distancesbetween pairs of documents.

Basically, this method relies on two sets X₂, and X₁ of histograms 2 ofwords and a vocabulary V of words, which are accessed at steps S1-S9,see the flowcharts of FIGS. 1-3. Each histogram 2 of each set X₁ or X₂pertains to a respective document 1, as schematically depicted in FIG.4A. That is, each document 1 is represented as a histogram 2 of words,as commonly used in the context of WMD and RWMD methods. As known perse, a histogram can be derived from a text document, so as to store theweights (e.g., term frequencies) of words (e.g., unique words) in thatdocument.

Histograms 2 are then grouped into sets X_(k), k=1, 2, as illustrated inFIG. 4A. In practice, the second set X₂ may for instance correspond to anewly received set of documents that need be compared to an existing(i.e., already stored) set X₁ of documents, as in another aspect of theinvention, later described in reference to FIG. 6.

Each of the two sets X_(k), k=1, 2, is assumed to be representable as asparse matrix, each row of which corresponds to a histogram X_(k)[l] ∈X_(k), l ∈ {1, . . . , N_(k)}. In addition, each histogram X_(k)[l] isrepresentable as a sparse vector, whose dimension is determined by thedimension (hereafter noted of the vocabulary V, as in the context of WADand RWMD methods. See FIGS. 4A, 4B for an illustration.

The present methods essentially comprise two phases S10, S20, which aimat computing distances between pairs of histograms (X₁[i], X₂[j]), ∀ i ∈{1, . . . , N₁} ∧∀ j ∈ {1, . . . , N₂}. Each of the two phases S10, S20is designed so that it need be performed for each histogram X₂[j] ∈ X₂.

The first phase S10 includes computations performed for each histogramX[j] ∈ X₂ and, in addition, for each word w in the vocabulary V. Indetail, a distance between said each word in V and a closest entry(i.e., in terms of distance) in said each histogram X₂[j] is computedS10. A dense, floating-point vector y is accordingly obtained S18. Thatis, results of operations performed at S10 are stored in y. Because suchoperations are performed for each word of V, the dimension of vector yis determined by the dimension |V| of the vocabulary V. That is, thenumber of entries in y corresponds to |V|.

The second phase comprises computing S20 (again for said each histogramX₂[j] ∈ X₂) a multiplication S24 between a matrix-representation of theset X₁ of histograms and the vector y. This multiplication is thus asparse-matrix, dense-vector multiplication. It is carried out S24 as amatrix-vector product. That is, multiple dot products are performed,i.e., one dot product between each X₁[i] and y. The vector y as usedduring the second phase is the vector as computed during the first phasefor that same histogram X₂[j]. Because the multiplication (matrix-vectorproduct) performed at step S24 involves a matrix-representation of X₁,it allows distances between all histograms X₁[i], i ∈ {1, . . . , N₁} ofthe set X₁ and said each histogram X₂[j] to be obtained (S26), in oneoperation. Note that one may rely on CPU/GPU implementations of theso-called BLAS library to perform the matrix multiplication operations,so as to compute dot-products in parallel on the available computeresources (using several processing cores and/or vector units asavailable on a CPU/GPU).

Repeating the two-phase process S10, S20 for each histogram X₂[j],distances between all pairs (X₁[i], X₂[j]) of histograms can eventuallybe obtained, ∨ i ∈ {1, . . . , N₁} ∧∀ j ∈ {1, . . . , N₂}. Finally,distances between pairs of documents 1 can subsequently be assessed S30,S40, based on the pairs of distances obtained, e.g., for ingesting,storing and searching across an amount of textual information.

A main advantage of the above scheme is that the relatively high cost ofthe first phase is amortized when considering a large set (e.g.,millions) of histograms X₁[l] ∈ X₁, owing to the simple multiplication(matrix-vector product) involved in the second phase. As discussed laterin detail, each of the two phases S10, S20 can in fact be implemented soas to have low (e.g., linear) complexity in terms of sizes of the setsX_(k) and sizes of the histograms X_(k)[l].

As a result, the present approach eventually allows the complexity to bereduced (with respect to implementations of the WMD and RWMD methods),while essentially preserving the accuracy of the prior WMD and RWMDtheories. Embodiments even allow a linear-complexity implementation (interms of average sizes of histograms) to be achieved, which can thus beregarded as a linear-complexity implementation of the RWMD method.

More generally, the new methods disclosed herein allow the average timecomplexity to be reduced, which, in practice, renders high-qualitysearch results offered by WMD and RWMD applicable to massive datasets.In that respect, another aspect of the invention concerns a method ofmanaging sets of documents, which is illustrated in the flowchart ofFIG. 6. This method assumes a scenario in which, given a set X₁ ofhistograms of words, where X₁ corresponds to a first set of documents ascurrently stored S50 in a pool of documents, a new set (i.e., a secondset) of documents is received S51. A corresponding set X₂ of histogramsof words is created S52, S53 for the second set of documents. Hereagain, histograms of X₂ pertain, each, to a respective document of thenew set of documents received S51. Then, distances between pairs ofdocuments of X₁ and X₂ are assessed, according to the method discussedabove. One may for instance rely on steps S4-S40 of FIG. 1. Yet, themethods illustrated in FIGS. 2 and 3 may also be used, as describedlater. In practice, the new set of documents is typically stored S54(e.g., integrated) in the existing pool of documents after havingassessed S40 distances between pairs of documents of the two sets. Amore efficient management (e.g., be it in terms of storage, integration,queries, etc.) will, in general, indeed be achieved if the new set isstored after analysis S4-S40.

Moreover, embodiments of the present methods are very suitable forparallelization, owing to the fact that the two-phase process S10, S20is performed for each histogram X₂[j]. In particular, the presentmethods are generally found to suitably map onto graphics processingunits (GPUs) and can thus be efficiently distributed across a cluster ofGPUs. In that respect, experiments conducted by the Inventor onreal-life datasets have notably demonstrated: (i) up to two orders ofmagnitude performance improvement with respect to a multi-GPUimplementation of the quadratic RWMD, as designed by the Inventor; and(ii) up to five-to-seven orders of magnitude performance improvementwith respect to a single-threaded CPU implementation of WMD that usesRWMD for pruning.

Interestingly, the present methods not only reduce the computationalcomplexity, but also make it possible to reduce storage requirementswith respect to a quadratic complexity implementation of the known RWMDmethod. In fact, because embodiments disclosed herein happen to map wellonto linear algebra primitives as supported by modern GPU programminginfrastructures, they effectively require a limited amount of workingmemory. Thus, they are very suitable for hardware acceleration. Reducingthe time complexity of RWMD (e.g., from quadratic to linear) furthermakes it possible to compute painwise distances across: (i) large setsof documents (e.g., millions of documents); and (ii) large histograms(e.g., millions of entries in histograms).

For instance, the present methods can very advantageously be appliedwhere one or each of the two sets X₂ and X₁ exceeds ˜1 000 000histograms. Note, however, that embodiments of the present methodsalready show net advantages, e.g., in terms of time complexity, forlower numbers of histograms. For example, it can be shown thatembodiments (such as discussed in sect. 2) are slightly faster than thequadratic RWMD method and, this, even if the two sets comprise, each ˜10000 histograms only. Incidentally yet, it is noted that the vocabularysize also plays a role. I.e., the size of V impacts, in practice, thenumber of histograms needed.

In addition, the present methods make it possible to use histogramshaving very large numbers of words. Namely, tractable computations canhere be contemplated, even if one or more of the histograms of any oreach of the sets X₂ and X₁ comprises ˜1 000 000 distinct words, or more.Even more so, tractable computations can still be achieved, within thepresent approach, if each of X₂ and X₁ comprises millions histograms,while one or more of these histograms comprise millions of distinctwords.

However, embodiments such as illustrated in FIGS. 1, 2 and discussed insect. 2 may happen to be slightly slower when smaller numbers ofhistograms are involved. Still, variants to such embodiments will laterbe discussed, in reference to FIG. 3, to address this issue.

Preferred embodiments are now discussed in detail. To start with,Euclidean distances are preferably relied on, consistently with formerWMD and RWMD approaches. In particular, the distance computed at stepS10 (between each word in V and the closest entry in a current histogramX₂[j]) is preferably a Euclidean distance. As a result, Euclideandistances are implied during the second phase too, where thesparse-matrix, dense-vector multiplication is computed. Still, othermetrics can be contemplated. However, distance metrics are generallypreferred over similarity metrics, because methods such as EMD and WMDnormally require to use a distance metric rather than a similaritymetric. E.g., a cosine similarity will typically not convene in thepresent case. Yet, complexities (computational costs) of the cosinesimilarity and Euclidean distance are essentially the same (it isdetermined by a matrix multiplication operation). In fact, if the wordvectors are L2-normalized, the cosine similarity can be translated intoEuclidean distance using a simple formula.

As illustrated in FIGS. 1-3, during the first phase S10, the vector y ispreferably computed S18 based. S9, S14 on two dense matrices. The firstone, W, stores vector representations of words in the vocabulary V. Thesecond one, T₂[j], stores vector representations of words of the currenthistogram X₂[j], as depicted in FIGS. 4A, 4B. Using dense matrices W andT₂[j] as defined above makes it possible to more easily computedistances between words in V and the closest entry in X₂[j], asexemplified below.

Different approaches can be contemplated for the operation performed atstep S18. What perhaps is the most efficient, is to perform S18 thisoperation as a modified matrix multiplication operation. Namely, thevector y may be computed S18 as min(W ∘ T₂[j]^(T)), where ∘ is amodified matrix multiplication operation, as depicted in FIG. 5A. Inthis specific operation, distances are computed between word vectors ofW and T₂[j]^(T) instead of computing mere dot products between W andT₂[j]^(T), as one would expect in a classic matrix multiplicationoperation. In addition, here, row-wise minimums of W ∘ T₂[j]^(T) arecomputed to obtain min(W ∘ T₂[j]^(T)). This makes it possible to deriveminimal distances between words in the vocabulary V and words in X₂[j],which provides a practice scheme to identify a distance between eachword in V and the closest entry in X₂[j]. Eventually, such minimaldistances are stored. S18 in the vector y, for the current iteration onX₂[j], it being reminded that such iterations can be parallelized, asnoted earlier.

The modified matrix multiplication operation described above makes itpossible to markedly reduce the complexity of the first phase. I.e., thecomplexity of this phase is determined by the ∘ operation and is givenby O(|V| d₂ M), as re-discussed later in detail. Although the ∘operation does not specifically contribute to reduce the complexity, theresult of this operation can be directly re-used for all X₁[i] in thesecond phase, which, as a whole, lowers the complexity of the process.

As further illustrated in FIGS. 1-3, the dense matrix W may storevectors of floating point numbers, as obtained using any suitable wordembedding process S5 (e.g., Word2Vec), so that words from the vocabularycan be mapped onto vectors of real numbers. In detail, for each word w∈{1, . . . , |V|} in the vocabulary V, the corresponding row W[w] of thematrix W may store a vector of M floating point numbers, as obtained viathe word embedding process. This way, each word w can be mapped onto avector of real numbers.

The modified matrix multiplication operation ∘ performed at step S18may, in practice, be performed as follows. First, a norm of each row ofW and each row of T₂[j] is computed. Then, W can be multiplied by thetranspose of T₂[j], in order to compute distances between each row of Wand each row of T₂[j]. Again, Euclidean distances are typically meant.I.e., the Euclidean distance between word vectors W[u] and W[v] is givenby ∥W[u]−W[v]∥₂ and can be calculated as follows, where ∥ ∥₂ is theEuclidean norm and ^(T) denotes a transpose:

∥W[u]W[v]∥₂=(∥W[u]∥₂ ² +∥W[v]∥₂ ²−2∥W[u]∥∥W[v]∥^(T))^(1/2).

This operation and its implications are described in more detail insect. 2.

Various optimizations can be contemplated. First, the vocabulary V maybe processed (see step S4, FIGS. 1-3), prior to starting the first phaseS10, so as for it to comprise a set of unique words as used in the unionof the two sets X₂ and X₁ of histograms. This way, the number ofoperations required during the first phase S10 can be markedly reduced,it being noted that the average number of unique words per document(excluding stop-words) determine the average histogram size (noted d).

As one may realize, one may further reduce the size of the vocabulary Vby removing S6 words from the vocabulary V that do not appear in the setX₁ of histograms, as such words are indeed not needed to determine thedistances between words in V and the closest entry in histogram X₂[j].This further lower the number of operations needed at step S10.

As made explicit in FIGS. 1-3, each of the first and second phases S10,S20 involves operations performed for each histogram X₂[j], such thatthat both phases S10, S20 are equally suitable for parallelization. Moreprecisely, the distance computation of step S18 and the sparse-matrix,dense-vector multiplication of step S24 can altogether be performedindependently for each histogram X₂[j] ∈ X₂. Accordingly, whileoperations involved in phases S10, S20 are preferably executedconsecutively, the two-phase operations S10, S20 can be computed inparallel, across all histograms X₂[j] ∈ X₂. In addition, because thepresent approach further allows a direct comparison against all X₁[i] ∈X₁, the complexity of the computation can further be reduced, even withrespect to a quadratic implementation of the RWMD method, see sect. 2.

This, as illustrated in sect. 2, eventually yields much fastercomputations. In particular, and as noted earlier, the two-phase processS10, S20 can be implemented using linear algebra primitives ofgeneral-purpose computing on GPUs. Their execution can thusadvantageously be distributed across a cluster of GPUs. All this isexplained in detail and illustrated by concrete results, in sect. 2.

By the end of the second phase S20, all distances relating to eachcurrent X₂[j] are stored in a data structure, which is noted D₁[:, j] asit may be regarded as the j^(th) column of a corresponding matrix D₁.Note, however, that the matrix D₁ is much preferably computedcolumn-by-column (i.e., one column at a time, or each column inparallel) and the notation “D₁[:, j]” as used herein does not imply thatthe matrix D₁ be computed as a whole. Yet, the final results, asobtained after completing the loop (or parallel computations) on eachX₂[j], may possibly be stored as an N₁×N₂ matrix D₁, step S30, forsubsequent use, e.g., for managing document sets, S40.

The complexity of the procedure of FIG. 1 is now described in detail. Tostart with, it can be verified that the time duration complexity of themethod of steps S12-S26 as shown in FIG. 1 is O(N₂|V|d₂ M+N₁N₂d₁). Here,N₁ and N₂ respectively correspond to the numbers of histograms of thetwo sets X₁ and X₂, while d₁ and d₂ are the respective, averagedimensions of histograms of the two sets. As noted earlier, |V| is thedimension of the vocabulary V and M is the number of floating pointnumbers stored in each row W[w] of the dense matrix W, the latterobtained by a word embedding process. In comparison, the time complexityof RWMD is O(N₁N₂d₁d₂M), meaning that the improvement is on the order ofO(min(N₁d₁/|V|, d₂M)).

As mentioned earlier, the storage space requirements too can benefitfrom the present methods. This is in particular true for the method ofsteps S12-S26 shown in FIG. 1, whose storage space complexity isO(N₁d₁+|V|M). In comparison, the complexity of a quadraticimplementation of the RWMD method is O(N₁d₁M), meaning that animprovement of O(min(N₁ d₁/|V|, M)) can be achieved.

As it can be realized, the ratio N₁d₁/|V| may likely be interpreted asindicating the extent in which an improvement can be achieved withrespect to the classic RWMD. Thus, the methods described so far willadvantageously be implemented where such a ratio is larger than (orequal to) 1. In practice, and as mentioned earlier, this typically isthe case when relatively large numbers of histograms are involved (e.g.,more than ˜10 000, on average, per set X_(k)).

The algorithm described so far merely relates to computing the costs ofmoving histograms X₁[i], i ∈{1 . . . N₁} to histograms X₂[j], j ∈ {1 . .. N₂}, as illustrated in the flowchart of FIG. 1. Yet, as one mayrealize, a tighter lower bound could be achieved by further computingthe costs of moving histograms X₂[j], j ∈ {1 . . . N₂} to X₁[i], i ∈ {1. . . N₁}, as explained now in reference to FIG. 2. In this variant, thesteps performed along the left-hand side branch of FIG. 2 are identicalto that of the right-hand side branch, which are themselves identical tothose of FIG. 1. The only differences are that X₁ and X₂ are swapped andthe new round of computations are now run for each X₁[i]. Thisadditional process gives rise to corresponding steps S9 a-S26 a (LHSbranch of FIG. 2), which are otherwise identical to steps S9-S26 (RHSbranch of FIG. 2, FIG. 1). Since the LHS algorithm of FIG. 2 is run foreach X₁[i], the operations performed in the two-phase process S10 a, S20a may once more be run in parallel. Practically, the additional processS10 a, S20 a may typically be performed after having performed thetwo-phase process S10, S20, especially if massive sets of documents areinvolved.

This eventually yield an N₂×N₁ matrix D₂, step S30 a. A final distancematrix D can be obtained, step S33, by computing the maximum values ofthe symmetric entries in D₁ and D₂ (i.e., D=max(D₁, D₂ ^(T))). This way,a tight lower bound to WMD can be obtained, it being reminded that RWMDis a lower bound of WMD. Yet, the maximum of D₁, D₂ ^(T) may be used toproduce an even tighter lower bound.

Note that each of the branches S9-S26 and S9 a-S26 a may benefit fromsimilar optimizations as already described earlier. First, thevocabulary taken into account for each distance computation steps S10,S20, S10 a, S20 a may be obtained S4 from an initial vocabulary V₀ thatcomprises a set of unique words used in the union of the sets X₂ and X₁.Second, after the word embedding process S5, the vocabularies used ineach branch shall preferably differ, so s to be optimized for eachconcurrent process. Namely, while the vocabulary V₁ taken into accountfor the RHS branch may be reduced S6 by removing words that do notappear in X₁, the vocabulary V₂ used in the LHS branch may conversely bereduced S6 a by removing words that do not appear in X₂, for the samereasons as mentioned previously.

Again, each of the two-phase processes (i.e., of each branch) ispreferably executed in parallel, i.e., for each histogram X₂[j] whencomputing pairs of steps (S10, S20) and for each histogram X₁[i] whencomputing pairs of steps (S10 a, S20 a). As noted earlier, this ispreferably achieved using linear algebra primitives of GPUs, so as forexecution of each branch to be distributed across a cluster of GPUs.

The complexity of the algorithm described in FIG. 2 can easily beinferred from earlier considerations. This algorithm too shows benefitswhen large numbers of histograms are involved. Yet, it necessarily hasthe same drawbacks as the algorithm of FIG. 1 when smaller numbers ofhistograms are involved (e.g., less than ˜10 000, on average, per setX_(k)). Thus, another class of embodiments are now discussed, inreference to FIG. 3, which address a case where one of the sets is largecompared to the other.

Basically, such embodiments involve a quadratic-complexity RWMD process,applied subsequently to core steps of a method such as described inFIG. 1. Yet, the additional RWMD process is applied to compute the costof moving histograms X₂[j] of the second set into a reduced version ofX₁, so as to maintain a low complexity.

The description of the following procedure assumes that, given two setsX₁, X₂, of histograms, X₁ is the largest set. As seen in FIG. 3, stepsS1-S26 are identical to steps S1-S26 of FIG. 1. At step S26, thedistances resulting from the two-phase process S10, S20 are stored, foreach j, in a first data structure, which is representable as a vectorD₁[:, j]. The latter corresponds to the j^(th) column of the matrix D₁,computed column-by-column, as described earlier. This, incidentally,makes it here again possible to parallelize the process of FIG. 3. Thedimension of D₁[:, j] is equal to N₁, i.e., each entry in D₁[:, j] isassociated with a histogram of the set X₁.

Next, D₁[:, j] is sorted S28 in ascending order of the distances storedthereon. That is, the first (or top) entry (or element) of D₁[:, j]corresponds to the smallest distance stored in D₁[:, j]. Then, thesorted vector is cropped S30 b, such that a first, cropped datastructure D^(c) ₁[:, j] is obtained, wherein only the K first entries ofthe previously sorted vector are retained. The K first entries retainedcorrespond to the K smallest distances of the sorted vector. Followingthis, a reduced set X^(c) ₁ of histograms associated with D^(c) ₁[:, j]is identified S31 b. This can be achieved based on the histogram indicescorresponding to the distances stored on D^(c) ₁[:, j]. Thus, thereduced set X^(c) ₁ corresponds to a reduced version of X₁(i.e., itcomprises a smallest portion of the initial histograms of X₁).

Now, a quadratic-complexity RWMD process is implemented S32 b, so as tocompute a cost of moving the j^(th) histogram X₂[j] of X₂ into thereduced set X^(c) ₁ of histograms. This gives rise to a second datastructure D^(c) ₂[:, j]. Finally, the quantity D^(c)[:, j]=max(D^(c)₁[:, j], D^(c) ₂[:, j]) is computed S33 b. This quantity is a variant tothe matrix D obtained at step S33 of FIG. 2, which can advantageouslyserve as a basis for subsequent management steps S40, wherein distancesbetween pairs of documents need be assessed.

As the above variants applies the RWMD method to a reduced set X^(c) ₁of histograms, it typically provides improved execution performance,compared with the bare (quadratic) RWMD process. For example, executionperformance can typically be improved by two order of magnitudes whenkeeping the top 1% histograms of X₁.

As apparent from FIG. 3, this variant does not compute the full matrixD, contrary to the algorithm of FIG. 2. Instead, it computes anapproximate version of the so-called R matrix directly, see sect. 2,using one column of D₁ at a time. Because of the reduced set X^(c) ₁considered, some accuracy might be lost in the process. To mitigatethis, one may increase the number of entries retained in D^(c) ₁[:, j].However, in practice, if the larger set X₁ has, e.g., millions ofhistograms, keeping the top 1-2% distances happens to producenear-optimal results. When X₁ is smaller (e.g., around 10 000histograms), there one may need to consider a higher percentage (e.g.,up to 10%). This, however, impacts the residual improvement. However, asubstantial improvement can still be obtained (e.g., increasing thepercentage of kept histograms of X₁ to 10% still allows the executiontime to be improved by one order of magnitude).

FIG. 7 depicts a general computerized unit 101, which can advantageouslybe used to implement the present methods. Such a unit 101 notablycomprises CPUs and/or GPUs configured for enabling parallelization ofcomputerized steps (e.g., two-phase process S10, S20 and/or S10 a, S20a) as involved in embodiments. Yet, the present methods may also involvevirtual machines, e.g., in the cloud, dedicated to the largecomputations. Additional detail are discussed in sect. 3.

Finally, the present invention may also be embodied as a computerprogram product. This program may for instance be run (at least partly)on a computerized unit 101 such as depicted in FIG. 7. This computerprogram product comprises a computer readable storage medium havingprogram instructions embodied therewith, which program instructions areexecutable by a processing unit (e.g., such as units 105 in FIG. 7), tocause the latter to take steps according to the present methods. Aspectsof the present computer programs are discussed in detail in sect. 3.1and 3.2.

The above embodiments have been succinctly described in reference to theaccompanying drawings and may accommodate a number of variants. Severalcombinations of the above features may be contemplated. Examples aregiven in the next section.

a. Detailed Description of Specific Embodiments

The purpose of this section is to provide a detailed description of anew algorithm for computing the Relaxed Word Mover's Distance whichreduces the average time complexity from quadratic to linear. Inpractice, the achieved reduction in complexity makes the high qualitysearch results offered by WMD and RWMD applicable for massive datasets.Additional aspects discussed in this section relate to:

Massively parallel GPU implementations and scalable distributedimplementations targeting clusters of GPUs;

Formal complexity analysis, practical scalability evaluation, andcomparisons with alternative methods;

Applications of the LC-RWND algorithm for k-nearest neighbors search andclassification-based tasks;

Extensive comparisons of prevalent text similarity measures with respectto both runtime and search quality.

The known WMD and RWMD approaches are discussed in detail in the nextsection. The detailed discussion and present considerations of the WMDand RWMD approaches that follow includes interpretations, comments,conclusions and implementations that originate from the presentInventors and were not included in the original papers. Thus, aspects ofthese discussion and considerations may not be considered as prior artfor the present document.

2.1. Practical Implementations of the WMD and RWMD Methods

2.1.1 General Concepts

The Word. Mover's Distance assesses the semantic distance between twodocuments. So, if two documents discuss the same topic they will beassigned a low distance, even if they have no words in common. WMDconsists of two components. The first is a vector representation of thewords. The second is a distance measure to quantify the affinity betweena pair histograms, which in essence represent a so-called bag of wordsrepresentation of each document in the new vector space. The vectorrepresentation creates an embedding space, in which semanticallysynonymous words will be close to each other. This is achieved usingword2vec which maps words in a vector space using a neural network. Infact, word2vec uses the weight matrix of the hidden layer of linearneurons as the vector representation of the words from a givenvocabulary. The distance between two documents is calculated as theminimum cumulative distance of the words from the first document to thewords of the second document. This optimization problem is well studiedin transportation theory and is called the Earth-Mover's Distance (EMD).Each document is represented as histogram of words and each word as amulti-dimensional vector. E.g., under word2vec each word may berepresented as a 300-dimensional vector.

Given two high-dimensional histograms P and Q, EMD attempts to discoverthe flow that minimizes the overall cost:

${{EMD}\left( {P,Q} \right)} = {{\min \; {\sum_{i,j}{f_{ij}{d_{ij}/{\sum_{i,j}{f_{ij}\mspace{14mu} {s.t.\mspace{14mu} f_{ij}}}}}}}} \geq {0{\sum\limits_{j}^{\;}f_{ij}}} \leq {P_{i}{\sum\limits_{i}^{\;}f_{ij}}} \leq Q_{i}}$${\sum\limits_{j}^{\;}{f_{ij}{\min\left( {{\sum\limits_{i}^{\;}P_{i}},{\sum\limits_{j}^{\;}Q_{j}}} \right)}}},$

where f_(ij) represents the flow, i.e., how much of the word i in P hasto flow to the word j in Q. EMD also has a probabilistic interpretationand is conceptually equivalent to the so-called Mallows distance onprobability distributions [4].

When P and Q are L1-normalized, the formulation of the EMD can besimplified as follows:

EMD(P, Q) = min  ∑_(i, j)f_(ij)d_(ij)  s.t.  f_(ij) ≥ 0${\sum\limits_{j}^{\;}f_{ij}} \leq {P_{i}{\sum\limits_{i}^{\;}f_{ij}}} \leq Q_{i}$

Such a simplification is possible because L1-normalization ensures thefollowing:

∑_(i)P_(i) = 1 ${\sum\limits_{j}^{\;}Q_{j}} = 1$ ∑_(i, j)f_(ij) = 1

In the remainder of this text, we assume that histograms areL1-normalized (e.g., by using a preprocessing step), which enables us touse the simplified formulation of the EMD problem.

The combination of word embeddings and ENID forms the core of the WMDdistance measure. In [2], WMD has been shown to outperform sevenstate-of-art baselines in terms of the k-Nearest-Neighbor classificationerror across several text corpora. This is because WMD can capturelinguistic similarities of semantic and syntactic nature and learns howdifferent writers may express the same viewpoint or topic, irrespectiveof whether they use same words or not.

2.1.2 Complexity of WMD and RWMD Methods

Assume we are given two sets of histograms X₁ and X₂, and a vocabularyof size V. The sets X₁ and X₂ can be seen as sparse matrices, whereineach row is a sparse vector of dimension V. Each row X[i] represents ahistogram that is extracted from a text document and stores the weights(e.g., term frequencies) of the unique words in that document. Thepopular compressed sparse rows (csr) representation is a convenient wayof storing the sparse matrices X₁ and X₂.

Assume that the sparse matrix X₁ has N₁ rows and the sparse matrix X₂has N₂ rows, and that we are further given a dense matrix W, whichstores the vector representations of the words that belong to thevocabulary V. Basically, for each word w ∈{1 . . . |V|} in thevocabulary V, W[w] stores a vector of M floating point numbers given bysome suitable embedding process (e.g., Word2Vec in the case of WMD).Given X₁, X₂, and W, the goal is to compute the distance between eachpair of histograms (X₁[i], X₂[j]), i ∈{1 . . . N₁}, j ∈{1 . . . N₂}.

The results of WMD and RWMD can be stored in an N₁×N₂ matrix D. Given ahistogram X₂[j], the top-K closest histograms in X₁ may be computed, andtheir positions in X₁ as well as the respective distance values may bestored in an N₂×K matrix R. The matrix D is typically computed onecolumn or one row at a time, and the top-K results in each column or roware stored in the matrix R.

Given two histograms X₁[i], i ∈{1 . . . N₁}, and X₂[j], j ∈ {1 . . . N₂}, the first step of WMD is to gather the vector representations of thewords in X₁[i] and X₂[j] from the matrix W. Assume that the number ofnonzeros in X₁[i] is d₁ and the number of nonzeros in X₂[j] is d₂. Thedense matrix T₁[i] stores the vector representations of the words inX₁[i] and has d₁ rows and M columns. Similarly, the dense matrix T₂[j]stores the vector representations of the words in X₂[j] and has d₂ rowsand M columns.

Once T₁[i] and T₂[j] are constructed, Euclidean distances between allpairs of word vectors are computed. As indicated in sect. 1, theEuclidean distance between word vectors W[u] and W[v] is given by ∥W[u]. . . W[v]∥₂, and can be calculated as follows, where ∥|₂ is theEuclidean norm and ^(T) denotes a transpose:

∥W[u]−W[v]∥₂=(∥W[u]∥₂ ² +∥W[v]∥₂ ²−2∥W[u]∥∥W[v]∥^(T))^(1/2)   (1)

Table I below aggregates notations as used herein.

TABLE I Data structures used in WMD-related computations Term TypeDescription W Input Word vectors for the vocabulary X Input A set ofhistograms (e.g., csr matrix) T[i] Auxiliary Word vectors for histogrami F[i] Auxiliary Word frequencies in histogram i D Output Distancematrix R Output Top-K results

As it can be realized, the Euclidean distances between all pairs of wordvectors in T₁[i] and T₂[j] form a d₁×d₂ dense matrix denoted as T₁[i] ∘T₂[j]^(T). Here, the operation is similar to a matrix multiplicationoperation, but instead of computing dot products, Euclidean distancesare computed between the word vectors.

In practice, T₁[i] ∘ T₂[j]^(T) can be computed by:

-   -   a. Calculating the Euclidean norm of each row of T₁[i];    -   b. Calculating the Euclidean norm of each row of T₂[i];    -   c. Multiplying T₁[i] by the transpose of T₂[j]; and    -   d. Applying Eq. (1) above 1 independently between each row of        T₁[i] and each row of T₂[j].

One may verify that the complexity of the ∘ operation across twomatrices, each with O(d) rows and O(M) columns is determined by thematrix multiplication operation and is given by O(d²M).

Solving the WIND problem: Suppose that for each X₁[i], i ∈{1 . . . N₁},a dense representation F₁[i], i ∈{1 . . . N₁} is constructed, In otherwords, if the number of nonzeros in X₁[i] is d₁, the size of F₁[i] isalso d₁. Similarly, for each X₂[j], j ∈ {1 . . . N₂}, a denserepresentation F₂[j], i ∈ {1 . . . N₂} is constructed. If the number ofnonzeros in X₂[j] is d₂, then the size of F₂[j] is d₂. EMD is computedbased on F₁[i], F₂[j], and T₁[i] ∘ T₂[j]^(T) (see FIG. 5). The notationintroduced so far is summarized in Table I (above) and Table II (below).

TABLE II Parameters used in WMD-related computations ParameterDescription N Average number of histograms V |V| Size of the vocabularyM Size of the word vectors d Size of a histogram K Used in top-Kcalculation

EMD constructs a bipartite graph: one side of the graph has d₁ nodes andthe other side d₂ nodes. There are in total d₁×d₂ edges. The weights ofthe nodes of the bi-partite graph are given by F₁[i] and F₂[j],respectively. The costs associated with the edges are given byT₁[i]∘T₂[j]^(T), that is, the Euclidean distance from one vector to theother. EMD solves a minimum cost flow problem on the resulting graph.

Assuming that the size of F₁[i] and F₂[j] is O(d), the complexity of EMDcomputation is O(d³ log(d)). Then, the overall complexity of WMDcomputation between two pairs of histograms is O(d²M+d³ log(d)). If N₁and N₂ are both O(N), the complexity of computing WMD across all pairsof histograms is O(N²d²M+N²d³ log(d)). So, the complexity of WMD growsquadratically in N (the number of documents), and cubically in d (theaverage size of histograms).

Solving the RWMD problem: The RWMD provides a lower-bound approximationof the WMD. Similar to WMD, the first step of RWMD is the computation ofT₁[i]∘T₂[j]^(T). Because T₁[i] has d₁ rows and T₂[j] has d₂ rows, T₁[i]∘ T₂[j]⁷ is a matrix off dimensions d₁×d₂. The second step of RWMD isthe computation of the minimum value of each raw of T₁[i]∘T₂[j]^(T),which produces a floating-point vector of dimension d₁. The third andthe final step of RWMD is a dot-product operation between F₁[i] and theresult of the second step, which produces a single floating-point value.

Because the RWMD computation is not symmetric, it is of advantage toperform it twice and swap T₁[i] with T₂[j]^(T) and F₁[j] with F₂[j]^(T)when repeating the process. The symmetric lower-bound is not necessarilyequal to the first lower-bound and computation of the maximum of thesetwo bounds provides an even tighter lower bound for WMD. In practice, itmay be realized that it is not necessary to compute T₂[j] ∘ T₁[i]^(T)explicitly because this matrix is the transpose of T₁[i]∘T₂[j]^(T). Itis sufficient to compute the minimum value in each column ofT₁[i]∘T₂[j]^(T), and multiply the result by F₂[j]^(T) to produce thesymmetric lower-bound approximation.

Finally, the complexity of the overall RWMD computation is determined bythe complexity of computing T₁[i] ∘ T₂[j]^(T), which is O(d²M). Whencomputed across two sets of size O(N) each, the overall complexity isO(N²d²M), which is significantly lower than the complexity of WMD.However the complexity of RWMD still grows quadratically with both N andd, which can be prohibitive when N and d are large.

Speeding-up WMD using pruning: A pruning technique was given in [2] tospeed-up the WMD computation, which uses the property that RWMD is atight lower bound for WMD. Given a histogram X₂[i], first RWMD iscomputed between X₂[i] and all histograms in X₁. Given a user-definedparameter K, the top-K closest histograms in X₁ are identified based onthe RWMD distances. After that, WMD is computed between X2[i] and thetop-K closest histograms in X₁. The highest WMD value computed by thisstep provides a cut-off value C. WMD is computed between X₂[i] and theremaining histograms in X₁ only if the pairwise RWMD value is lower thanC. All other histograms will be pruned because they cannot be part ofthe top-K results of WMD. Naturally, when K is small, C also becomessmall, which leads to a more effective pruning.

Other methods: The so-called Word centroid distance (WCD) methodprovides another approximation of WMD, wherein a single vector, calledthe centroid, is computed for each histogram X[i] by simply multiplyingX[i] and W, see ref. [2]. This operation, essentially, computes aweighted average of all the embedding vectors associated with the wordsthat are in X[i]. The WCD between two histograms X₁[i] and X₂[j] isgiven by the Euclidean distance between the respective centroids. Thecomplexity of computing all the centroids is then O(NdM) and thecomplexity of computing all the distances across two sets of size O(N)each is O(N²M). When N>>d, the overall complexity becomes O(N²M).Although WCD has much lower complexity than RWMD, it is known to be avery crude approximation and it is not regarded as a tight lower boundof WMD. Therefore, WCD is generally not suitable for applications thatrequire a high accuracy.

Cosine similarity is a popular way of producing similarity scores acrossdocuments. Negating a similarity score is sufficient to transform itinto a distance metric. Cosine similarity between to histograms X₁[i]and X₂[j] can be computed simply by a dot-product operation, which hascomplexity O(d). The complexity of computing all pairwise similarityscores across two sets of size O(N) each is then O(N²d). Thus, cosinesimilarity has a much lower complexity than RWMD. However, it is neithera lower bound of WMD, nor it can take advantage of word embedding.

2.2. Linear-Complexity Implementation of the RWMD Method (According toEmbodiments)

A main disadvantage of the quadratic RWMD method is that it computes thedistances between the same pair of words multiple times. This overheadcould be eliminated by precomputing the distances between all possiblepairs of words in the vocabulary in advance. However, typical vocabularysizes are in the order of a few million terms, which could, forinstance, include company or person names in a business analyticssetting. Storing distances across millions of terms requires tens ofterabytes of storage space. In addition, accesses to this storage wouldbe completely random, rendering software parallelization andhardware-acceleration impractical. Second, even if we assume that thedistances across all words in the vocabulary are pre-computed and storedin a large and fast storage medium that enables parallel randomaccesses, if a given word w appears in histograms X₁[i], i ∈ {1 . . .N₁} n times, the quadratic RWMD method computes the word closest to w(i.e., the minimum distance) in each of the histograms X₂[j], j ∈ {1 . .. N₂} exactly n times. Such a redundancy, again leads to a quadratictime complexity.

Thus, a new method is proposed for computing RWMD, which addresses eachof the above problems and reduces the average time complexity to linear.This approach not only reduces the computational complexity, but alsoreduces the storage requirements with respect to the quadratic RWMD. Thenew approach further maps well onto linear algebra primitives supportedby modern GPU programming infrastructures, and requires a limited amountof working memory. Thus, it is very suitable for hardware acceleration.Reducing the time complexity of RWMD from quadratic to linear furthermakes it possible to compute pairwise distances across (i) large sets ofdocuments (e.g., millions of documents) and (ii) large histograms (e.g.,millions of entries in histograms).

The present Inventors has implemented this linear-complexity method andconcludes that it is the only practical way of computing RWMD betweenall pairs across millions of documents, which leads to trillions of RWMDcomputations. Notably, he arrived at orders of magnitude performanceimprovements with respect to a concurrent implementation of thequadratic-complexity RWMD method of Kusner et al. [2]

As described in the previous section, a focal point of the present workis a novel decomposition of the RWMD computation into two phases, whereeach phase has reduced complexity in terms of the size of the histogramsin X₁ and X₂. In the LC-RWMD implementation discussed herein, each phaseexactly has linear complexity in terms of the size of the histograms.

That is, for a given histogram X₂[j], j ∈ {1 . . . N₂}, the first phasecomputes the Euclidean distance to the closest entry in histogram X₂[j]for each word in the vocabulary, which produces y, a densefloating-point vector of dimension |V| (see FIG. 3A). The second phaseperforms a sparse-matrix, dense-vector multiplication between X₁ and y,which produces the RWMD between all X₁[i], i ∈ {1 . . . N₁} and X₂[j],see FIG. 3B).

The first phase is similar to the quadratic-complexity RWMDimplementation proposed above in that it treats the vocabulary V as asingle histogram and computes pairwise Euclidean distances across allthe words in the vocabulary and the words in a given histogram X₂[j] bycomputing W ∘ T₂[j]^(T). Next, row-wise minimums are computed to derivethe minimum-distances between the words in the vocabulary and the wordsin X₂[j], and the results are stored in the vector y of size |V|. Thecomplexity of this phase is determined by the ∘ operation and is givenby O(|V|dM).

In the second phase, X₁ is multiplied by y to compute the RWMD. Thisphase is essentially a sparse-matrix, dense vector multiplicationoperation. For each X₁[i], i ∈{1 . . . N₁}, this phase gathers theminimum distances from y based on the positions of the words in X₁[i]and then computes a dot-product with F₁[i]. Therefore, the overallfunctionality is equivalent to the quadratic-complexity RWMD givenearlier. Note that the second phase may compute distances across allhistograms in X₁ and a single histogram from X₂ in parallel, hence itstime complexity, is O(Nd). A main advantage of the present approach isthat the relatively high cost of the first phase is amortized over alarge number of histograms X₁[i], i ∈{1 . . . N₁} in the second phase.

The overall complexity of the linear complexity RWMD (hereafter LC-RWMD)algorithm when comparing a single histogram from X₂ against allhistograms in X₁ is then O(|V|dM+Nd). The overall complexity whencomparing all histograms of X₂ against all histograms of X₁ is O(N|VdM+N²d). When N and |V| are of the same order of magnitude, the overallcomplexity becomes O(N²dM). Even though this complexity still growsquadratically with the number of histograms (i.e., N), it grows linearlywith the average size of the histograms (i.e., d).

The LC-RWMD algorithm can be adapted to compare several histograms fromX₂ with all the histograms in X₁. In this case, the first phase ofLC-RWMD computes the word with the minimum distance in each histogram ofX₂ for each word in the vocabulary. Assume that s histograms from X₂ areused in the comparison. The result of the first phase is then a matrix Ywith |V| rows and s columns. In the second phase, a sparse-matrix,dense-matrix multiplication is performed between X₁ and Y, whichproduces a distance matrix with N₁ rows and s columns.

The algorithm described so far computes the costs of moving histogramsX₁[i], i ∈{1 . . . N₁} to histograms X₂[j], j ∈ {1 . . . N₂}, asotherwise illustrated in the flowchart of FIG. 1. Assume that theseresults are stored in an N₁×N₂ matrix D₁. To achieve a tight lower boundfor WMD, the costs of moving the histograms X₂[j], j ∈ {1 . . . N₂}toX₁[i], i ∈ {1 . . . N₁} may further be computed, as illustrated in theflowchart of FIG. 2. Therefore, after computing D₁ in full, one may swapX₁ and X₂, and run the LC-RWMD algorithm once again, which produces anN₂×N₁ matrix D₂. The final distance matrix D is produced by computingthe maximum values of the symmetric entries in D₁ and D₂ (i.e.,D=max(D₁, D₂ ^(T))).

When O(N)=O(|V|), the LC-RWMD implementation reduces the averagecomplexity of computing RWMD between a pair of histograms from O(d²M) toO(dM). Such a reduction makes LC-RWMD orders of magnitude faster thanthe quadratic RWMD. When O(N)=O(|V|), the reduction in complexity isgiven by min(Nd/|V, dM), which is typically determined by the firstterm, i.e., Nd/|V|. As a result, using the LC-RWMD method hassignificant advantages when the size of the histograms d is large and/orthe number of histograms N is larger than or equal to the number ofwords |V| in the vocabulary. In general, the present methods willprovide a net advantage over prior art method when the factor Nd/|V| islarger than 1.

Next, an important optimization can be achieved for the LC-RWMDimplementations by reducing the size of the vocabulary, which isachieved by eliminating words that do not appear in X₁. Similarly, whenX₁ and X₂ are swapped to refine the distance results as described above,one may eliminate the words that do not appear in X₂ from thevocabulary.

2.3. Parallel Implementations of the LC-RWMD Method

The present Inventor has developed GPU-accelerated and distributedimplementations of the linear-complexity RWMD algorithm as well as themore straightforward quadratic-complexity approach, and demonstratedexcellent scalability for both.

Namely, one may parallelize the computation of a complete column of thepairwise distance matrix D, by storing all T₁[i], i ∈ {1 . . . N₁}matrices in the GPU memory, and by performing pairwise RWMD computationsacross all T₁[i], i ∈{1 . . . N₁} and a single T₂[j], j ∈ {1 . . . N₁}at once. The RWMD computations are performed using an adaptation of theimplementation discussed in sect. 1. The Euclidean distance computationsacross all T₁[i], i ∈ {1 . . . N₁} matrices and a single T₂[j], j ∈ {1 .. . N₂} are performed by combining all T₁[i], i ∈{1 . . . N₁} in asingle matrix T₁ with O(Nd) rows and M columns, and by using theso-called NVIDIA's CUBLAS library for multiplying it against T₂[j]. Therow-wise and column-wise minimum operations are performed using NVIDIA'sThrust library. The dot-product operations are performed again usingCUBLAS. Such an implementation requires O(NdM) space in the GPU memorybecause each T₁[i] matrix requires O(dM) space, and we store O(N) suchmatrices simultaneously in the GPU memory.

In the first phase, the implementation of LC-RWMD uses the CUBLASlibrary for computing Euclidean distances between W and a single T₂[j],j ∈ {1 . . . N₂}, and then the Thrust library is used for computingrow-wise minimums. In the second step, the NVIDIA's CUSPARSE library canbe used for multiplying X₁ by the result of the first phase. Unlike thequadratic-complexity implementation, the linear-complexity RWMD does notstore T₁[f], i ∈{1 . . . N₁} matrices in the GPU memory. Instead ofallocating the dense T₁ matrix that stores embedding vectors for all thewords in all the histograms in X₁, LC-RWMD simply stores the sparsematrix X₁, which requires O(Nd) space, and the embedding vectors for thecomplete vocabulary (i.e., W), which requires O(|V| M) space. Therefore,the overall space complexity of LC-RWMD is O(Nd+|V|M). When O(|V)=O(N),the overall space complexity becomes O(Nd+NM). As a result, the spacecomplexity of LC-RWMD is smaller than the space complexity of thequadratic-complexity RWMD by a factor of min(d, M). In conclusion,LC-RWMD not only reduces the time complexity by a factor of d withrespect to the quadratic RWMD, but also reduces the space complexity bya similar factor.

Table III below gives a summary of these results.

TABLE III Comparison of complexities of the LC-RWMD and RWMD Method TimeComplexity Space Complexity LC-RWMD O(N|V|dM + N²d) O(Nd + |V|M) RWMDO(N²d²M) O(NdM) Reduction O(min(Nd/|V|, dM)) O(min(Nd/|V|, M))

Finally, all the algorithms (as described in this section) aredata-parallel and can easily be distributed across several GPUs.Spreading either X₁ or X₂ across several GPUs is sufficient. The onlyfunction that may require communication across multiple GPU nodes in adistributed setting is the top-K computation. However, the respectivecommunication overhead is typically marginal compared to the cost of thedistance computation algorithms.

2.4. Performance Evaluation

Table IV below summarizes the characteristics of the datasets used inthe present experiments in terms of the number of documents (N) and theaverage number of unique words per document excluding the stop-words,which indicates the average histogram size (d). Both datasets areproprietary system of the Applicant.

The present algorithms require two datasets for comparison. The firstset X₁ is resident (i.e., fixed) and the second set X₂ is transient(i.e., unknown). In the present experiments, the datasets given in TableIV are treated as resident sets. Given a second set of documents (i.e.,X₂), one may compare it against X₁. In typical use cases, X₁ and X₂ arecompletely independent sets, e.g., one is the training set and the otheris the test set. However, in the experiments presented, X₂ can bedefined as a randomly selected subset of X₁ for the sake of simplicity.The experiments presented in this section use embedding vectorsgenerated by word2vec, the vocabulary size is three million words (i.e.,|V|=3 000 000), and each word vector is composed of 300 single-precisionfloating-point numbers (i.e., M=300). However, the number of embeddingvectors the LC-RWMD algorithm stores in the GPU memory is given by thenumber of unique words that are found in the resident dataset. Thisnumber is called V_(eff), the respective values are shown in Table IV.

The present algorithms have been deployed on a cluster of four IBMPOWER8+ nodes, where each node had two-socket CPUs with 20 cores and 512GB of CPU memory. In addition, each node had four NVIDIA Tesla P100 GPUsattached via NVLINK interfaces. Each P100 GPU has 16 GB of memory, halfof which was used to store the resident data structures, and the restfor the temporary data structures. The POWER8+ nodes were connected via100 Gb Infiniband interfaces. All code was written in C++, the version14.0.1 of IBM's XL C/C++ compiler was used to build the code, CUDA 8.0was used to program the GPUs, and IBM's Spectrum MPI was used forinter-node and inter-process communication.

TABLE IV Characteristics of the datasets used in experiments Data set Nd (average) Veff BDS News 1.M 107.5 452058 CDS News 2.8M 27.5 292492

Note that the first step of WMD is to compute the Euclidean distances.This step too can be GPU-accelerated. The second step uses a recent EMDlibrary [5], which offers a single threaded CPU implementation. However,it is possible to run multiple EMD computations in parallel, acrossdifferent pairs of histograms using multiple CPU threads.

Two datasets (hereafter called BDS and CDS) were considered, eachrelated to financial portal news, for the purpose of testing the presentmethods. The results obtained show that the current implementation ofWMD (which implementation, again, is not prior art per se) is two tothree orders of magnitude slower than the single-GPU implementation ofthe quadratic-complexity RWMD (which implementation is not prior arteither), when applied to the CDS dataset. As expected, the factor ofslowdown increases with K, which has a significant impact on theefficiency of pruning. When K=N, there is effectively no pruning. Theobserved slow-down was much higher for the BDS dataset because of thelarger average histogram size of this dataset. In fact, using the BDSdataset, the present Inventor was not able to collect WMD results forK>100, as this would require several weeks of execution time. As onerealizes, such a high cost renders WMD impractical for real-lifeapplications.

The ratio of overlap between the top-K results of WMD and the top-Kresults of RWMD was investigated. The results obtained have shown thatthe RWMD ratio of overlap varies between 0.72 and 1, illustrating thatRWMD can be a high-quality approximation to WMD, at least in theimplementation adopted here. On the contrary, WCD overlap ratios werefound to be as low as 0.13, which tends to confirm that WCD is not ahigh-quality approximation to WMD.

Finally, a preferred implementation of the LC-RWMD method was comparedwith a quadratic complexity implementation of RWMD, i.e., havingquadratic complexity in terms of runtime performance. Were notablyinvestigated the average times to compute the distances between a singletransient histogram and: on the one hand, one million residenthistograms from the BUS dataset; and 2.8 million resident histogramsfrom the CDS dataset. As it turned out, the LC-RWMD was faster than thequadratic RWMD typically by a factor of d, as expected. Now, whenrunning LC-RWMD on a single GPU, the time to compute one milliondistances was on the order of 100 milliseconds for the BDS dataset. Forthe CDS dataset, which has significantly smaller histograms, 2.8 milliondistances could be computed in about 80 milliseconds only. Both thequadratic RWMD and the LC-RWMD were shown to benefit fromparallelization on multiple GPUs. Both are data-parallel algorithms andshow linear speed-up as the number of GPUs increases. Note that eitherthe resident dataset or the transient dataset can be distributed acrossmultiple GPUs. With the quadratic RWMD, storage requirements perhistogram are much higher, which means that fewer histograms can be fitinto the GPU memory. When the complete resident set does not fit intothe GPU memory, the resident histograms need be copied into the GPUmemory in several chunks, which creates additional runtime overhead. Tominimize this overhead, one may distribute the resident set whenmultiple GPUs are available. For instance, when using 16 GPUs, the CDSdataset completely fits into the distributed GPU memory, and the copyoverheads are completely eliminated, which results in super-linearspeedup for the quadratic RWMD. In case of the LC-RWMD, the datastructures are much more compact, and millions of histograms can bestored in the memory of a single GPU. When several GPUs are available,distributing the resident dataset does typically not result in a linearspeed-up. However, distributing the transient dataset across severalGPUs and replicating the resident dataset on all the GPUs leads to alinear speedup, especially if the transient dataset is much larger thanthe resident dataset.

2.5. Final Remarks

As disclosed in the present document, it is possible to designhigh-quality (RWMD-like) methods for assessing distances between pairsof documents, which can be efficiently implemented. In particular,embodiments of such methods, such as the LC-RWMD method, can beimplemented in linear time (on average) when computing distances acrosslarge sets of documents. Practical implementations of the presentmethods were further shown to map well onto commonly used, dense andsparse linear algebra routines, and to efficiently execute on CPUsand/or GPUs. In addition, such implementations can be efficiently scaledout across several CPUs and/or GPUs and exhibit a perfect strong scalingor weak scaling behavior. Lastly, and as the present Inventor hasfurther observed, the accuracy of the underlying, RWMD-like algorithm ispreserved.

The LC-RWMD method presented applies row-wise or column-wise reductionsusing minimum operations. However, it is possible to use of other typesof reduction methods once the pairwise distances are computed betweenword vectors. For instance, the LC-RWMD can also use the maximumdistance in each row/column or the sum of all the distances in eachrow/column.

a. Technical Implementation Details

3.1 Computerized System and Devices

Computerized systems and devices can be suitably designed forimplementing embodiments of the present invention as described herein.In that respect, it can be appreciated that the methods described hereinare largely non-interactive and automated. In exemplary embodiments, themethods described herein can be implemented either in an interactive,partly-interactive or non-interactive system. The methods describedherein can be implemented in software, hardware, or a combinationthereof. In exemplary embodiments, the methods described herein areimplemented in software, as an executable program, the latter executedby suitable digital processing devices. More generally, embodiments ofthe present invention can be implemented wherein virtual machines and/orgeneral-purpose digital computers, such as personal computers,workstations, etc., are used.

For instance, the system depicted in FIG. 7 schematically represents acomputerized unit 101, e.g., a general- or specific-purpose computer.

In exemplary embodiments, in terms of hardware architecture, as shown inFIG. 7, the unit 101 includes at least one processor 105, and a memory110 coupled to a memory controller 115. Preferably though, severalprocessors (CPUs, and/or GPUs) are involved, to allow parallelization,as discussed earlier. To that aim, the processing units may be assignedrespective memory controllers, as known per se.

One or more input and/or output (I/O) devices 145, 150, 155 (orperipherals) are communicatively coupled via a local input/outputcontroller 135. The input/output controller 135 can be coupled to orinclude one or more buses and a system bus 140, as known in the art. Theinput/output controller 135 may have additional elements, which areomitted for simplicity, such as controllers, buffers (caches), drivers,repeaters, and receivers, to enable communications. Further, the localinterface may include address, control, and/or data connections toenable appropriate communications among the aforementioned components.

The processor(s) 105 is a hardware device for executing software,particularly that stored in memory 110. The processor(s) 105 can be anycustom made or commercially available processor(s), may include one ormore central processing units (CPUs) and/or one or more graphicsprocessing units (GPUs), or, still, have an architecture involvingauxiliary processors among several processors associated with thecomputer 101. In general, it may involve any type of semiconductor basedmicroprocessor (in the form of a microchip or chip set), or generallyany device for executing software instructions.

The memory 110 can include any one or combination of volatile memoryelements (e.g., random access memory) and nonvolatile memory elements.Moreover, the memory 110 may incorporate electronic, magnetic, optical,and/or other types of storage media. Note that the memory 110 can have adistributed architecture, where various components are situated remotefrom one another, but can be accessed by the processor(s) 105.

The software in memory 110 may include one or more separate programs,each of which comprises an ordered listing of executable instructionsfor implementing :logical functions. In the example of FIG. 7, thesoftware in the memory 110 includes computerized methods, forming partof all of methods described herein in accordance with exemplaryembodiments and, in particular, a suitable operating system (OS) 111.The OS 111 essentially controls the execution of other computer programsand provides scheduling, input-output control, file and data management,memory management, and communication control and related services.

The methods described herein (or part thereof) may be in the form of asource program, executable program (object code), script, or any otherentity comprising a set of instructions to be performed. When in asource program form, then the program needs to be translated via acompiler, assembler, interpreter, or the like, as known per se, whichmay or may not be included within the memory 110, so as to operateproperly in connection with the OS 111. Furthermore, the methods can bewritten as an object oriented programming language, which has classes ofdata and methods, or a procedure programming language, which hasroutines, subroutines, and/or functions.

Possibly, a conventional keyboard and mouse can be coupled to theinput/output controller 135. Other I/O devices 140-155 may be included.The computerized unit 101 can further include a display controller 125coupled to a display 130. In exemplary embodiments, the computerizedunit 101 can further include a network interface or transceiver 160 forcoupling to a network, to enable, in turn, data communication to/fromother, external components.

The network transmits and receives data between the unit 101 andexternal devices. The network is possibly implemented in a wirelessfashion, e.g., using wireless protocols and technologies, such as Wifi,WiMax, etc. The network may be a fixed wireless network, a wirelesslocal area network (LAN), a wireless wide area network (WAN) a personalarea network (PAN), a virtual private network (VPN), intranet or othersuitable network system and includes equipment for receiving andtransmitting signals.

The network can also be an IP-based network for communication betweenthe unit 101 and any external server, client and the like via abroadband connection. In exemplary embodiments, network can be a managedIP network administered by a service provider. Besides, the network canbe a packet-switched network such as a LAN, WAN, Internet network, anInternet of things network, etc.

If the unit 101 is a PC, workstation, intelligent device or the like,the software in the memory 110 may further include a basic input outputsystem (BIOS). The BIOS is stored in ROM so that the BIOS can beexecuted when the computer 101 is activated. When the unit 101 is inoperation, the processor(s) 105 is(are) configured to execute softwarestored within the memory 110, to communicate data to and from the memory110, and to generally control operations of the computer 101 pursuant tothe software.

The methods described herein and the OS 111, in whole or in part areread by the processor(s) 105, typically buffered within the processor(s)105, and then executed. When the methods described herein areimplemented in software, the methods can be stored on any computerreadable medium, such as storage 120, for use by or in connection withany computer related system or method.

3.2 Computer Program Products

The present invention may be a system, a method, and/or a computerprogram product at any possible technical detail level of integration.The computer program product may include a computer readable storagemedium (or media) having computer readable program instructions thereonfor causing a processor to carry out aspects of the present invention.

The computer readable storage medium can be a tangible device that canretain and store instructions for use by an instruction executiondevice. The computer readable storage medium may be, for example, but isnot limited to, an electronic storage device, a magnetic storage device,an optical storage device, an electromagnetic storage device, asemiconductor storage device, or any suitable combination of theforegoing. A non-exhaustive list of more specific examples of thecomputer readable storage medium includes the following: a portablecomputer diskette, a hard disk, a random access memory (RAM), aread-only memory (ROM), an erasable programmable read-only memory (EPROMor Flash memory), a static random access memory (SRAM), a portablecompact disc read-only memory (CD-ROM), a digital versatile disk (DVD),a memory stick, a floppy disk, a mechanically encoded device such aspunch-cards or raised structures in a groove having instructionsrecorded thereon, and any suitable combination of the foregoing. Acomputer readable storage medium, as used herein, is not to be construedas being transitory signals per se, such as radio waves or other freelypropagating electromagnetic waves, electromagnetic waves propagatingthrough a waveguide or other transmission media (e.g., light pulsespassing through a fiber-optic cable), or electrical signals transmittedthrough a wire.

Computer readable program instructions described herein can bedownloaded to respective computing/processing devices from a computerreadable storage medium or to an external computer or external storagedevice via a network, for example, the Internet, a local area network, awide area network and/or a wireless network. The network may comprisecopper transmission cables, optical transmission fibers, wirelesstransmission, routers, firewalls, switches, gateway computers and/oredge servers. A network adapter card or network interface in eachcomputing/processing device receives computer readable programinstructions from the network and forwards the computer readable programinstructions for storage in a computer readable storage medium withinthe respective computing/processing device.

Computer readable program instructions for carrying out operations ofthe present invention may be assembler instructions,instruction-set-architecture (ISA) instructions, machine instructions,machine dependent instructions, microcode, firmware instructions,state-setting data, configuration data for integrated circuitry, oreither source code or object code written in any combination of one ormore programming languages, including an object oriented programminglanguage such as Smalltalk, C++, or the like, and procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The computer readable program instructions may executeentirely on the user's computer, partly on the user's computer, as astand-alone software package, partly on the user's computer and partlyon a remote computer or entirely on the remote computer or server. Inthe latter scenario, the remote computer may be connected to the user'scomputer through any type of network, including a local area network(LAN) or a wide area network (WAN), or the connection may be made to anexternal computer (for example, through the Internet using an InternetService Provider). In some embodiments, electronic circuitry including,for example, programmable logic circuitry, field-programmable gatearrays (FPGA), or programmable logic arrays (PLA) may execute thecomputer readable program instructions by utilizing state information ofthe computer readable program instructions to personalize the electroniccircuitry, in order to perform aspects of the present invention.

Aspects of the present invention are described herein with reference toflowchart illustrations and/or block diagrams of methods, apparatus(systems), and computer program products according to embodiments of theinvention. It will be understood that each block of the flowchartillustrations and/or block diagrams, and combinations of blocks in theflowchart illustrations and/or block diagrams, can be implemented bycomputer readable program instructions.

These computer readable program instructions may be provided to aprocessor of a general purpose computer, special purpose computer, orother programmable data processing apparatus to produce a machine, suchthat the instructions, which execute via the processor of the computeror other programmable data processing apparatus, create means forimplementing the functions/acts specified in the flowchart and/or blockdiagram block or blocks. These computer readable program instructionsmay also be stored in a computer readable storage medium that can directa computer, a programmable data processing apparatus, and/or otherdevices to function in a particular manner, such that the computerreadable storage medium having instructions stored therein comprises anarticle of manufacture including instructions which implement aspects ofthe function/act specified in the flowchart and/or block diagram blockor blocks.

The computer readable program instructions may also be loaded onto acomputer, other programmable data processing apparatus, or other deviceto cause a series of operational steps to be performed on the computer,other programmable apparatus or other device to produce a computerimplemented process, such that the instructions which execute on thecomputer, other programmable apparatus, or other device implement thefunctions/acts specified in the flowchart and/or block diagram block orblocks.

The flowchart and block diagrams in the Figures illustrate thearchitecture, functionality, and operation of possible implementationsof systems, methods, and computer program products according to variousembodiments of the present invention. In this regard, each block in theflowchart or block diagrams may represent a module, segment, or portionof instructions, which comprises one or more executable instructions forimplementing the specified logical function(s). In some alternativeimplementations, the functions noted in the blocks may occur out of theorder noted in the Figures. For example, two blocks shown in successionmay, in fact, be executed substantially concurrently, or the blocks maysometimes be executed in the reverse order, depending upon thefunctionality involved. It will also be noted that each block of theblock diagrams and/or flowchart illustration, and combinations of blocksin the block diagrams and/or flowchart illustration, can be implementedby special purpose hardware-based systems that perform the specifiedfunctions or acts or carry out combinations of special purpose hardwareand computer instructions.

While the present invention has been described with reference to alimited number of embodiments, variants and the accompanying drawings,it will be understood by those skilled in the art that various changesmay be made and equivalents may be substituted without departing fromthe scope of the present invention. In particular, a feature(device-like or method-like) recited in a given embodiment, valiant orshown in a drawing may be combined with or replace another feature inanother embodiment, variant or drawing, without departing from the scopeof the present invention. Various combinations of the features describedin respect of any of the above embodiments or variants may accordinglybe contemplated, that remain within the scope of the appended claims. Inaddition, many minor modifications may be made to adapt a particularsituation or material to the teachings of the present invention withoutdeparting from its scope. Therefore, it is intended that the presentinvention not be limited to the particular embodiments disclosed, butthat the present invention will include all embodiments falling withinthe scope of the appended claims. In addition, many other variants thanexplicitly touched above can be contemplated.

REFERENCES

-   [1] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.    Efficient estimation of ord. representations in vector space. CoRR,    abs/1301,3781 2013.-   [2] Matt J. Kusner, Yu Sun, Nicholas I. Kolkin, and Kilian Q.    Weinberger. From word embeddings to document distances. In    Proceedings of the 32nd International Conference on Machine    Learning, ICML 2015, Lille, France, 6-11 Jul. 2015, pages 957-966,    2015.-   [3] Yossi Rubner, Carlo Tomasi, and Leonidas J Guibas. A metric for    distributions with applications to image databases. In Computer    Vision, 1998. Sixth International Conference on, pages 59-66. IEEE,    1998.-   Peter Bickel Elizave a Levina. The earth mover's distance is the    mallows distance: Some insights from statistics. In Proceedings    ICCV, 2001.-   [5] Ofir Pele and Michael Werman. FastEarth Mover's Distance (EMD)    Code, 2013. http://www.ariel.ac.il/sites/ofirpele/FastEMD/code/.

What is claimed is:
 1. A computer-implemented method for assessingdistances between pairs of documents, the method comprising: accessing:two sets X₂ and X₁ of histograms of words, each of the histogramspertaining to a respective document; and a vocabulary V of words,wherein each of the two sets X_(k), k=1, 2, is representable as a sparsematrix, each row of which corresponds to a histogram X_(k)[l] ∈ X_(k), l∈ {1, . . . , N_(k)}, and wherein each histogram X_(k)[l] isrepresentable as a sparse vector, whose dimension is determined by adimension of the vocabulary V; and computing distances between pairs ofhistograms (X₁[i], X₂[j]), ∀ i ∈ {1, . . . , N₁}∧∀ j ∈ {1, . . . , N₂},by: computing, for each histogram X₂[j] ∈ X₂, and for each word in V, adistance between said each word in V and a closest entry in said eachhistogram X₂[j], to obtain a dense, floating-point vector y, whosedimension is determined by the dimension of the vocabulary V; andcomputing, for said each histogram X₂[j] ∈ X₂, a sparse-matrix,dense-vector multiplication between a matrix-representation of the setX₁ of histograms and the vector y, as a matrix-vector product, to obtaindistances between all histograms X₁[i], i ∈ {1. . . , N₁} of the set X₁and said each histogram X₂[j], based on which distances between pairs ofdocuments can be assessed.
 2. The method according to claim 1, whereinthe distance between said each word in V and a closest entry in saideach histogram X₂[j], as computed for each histogram X₂[j] ∈ X₂ and foreach word in V, is a Euclidean distance.
 3. The method according toclaim 1, wherein, at computing the distance between said each word in Vand the closest entry in said each histogram X₂[j], the vector y iscomputed based on: a dense matrix W that stores vector representationsof words in the vocabulary V, and a dense matrix T₂[j] that storesvector representations of words in X₂[j].
 4. The method according toclaim 3, wherein, at computing the distance between said each word in Vand the closest entry in said each histogram X₂[j], the vector y iscomputed as min(W ∘ T₂[j]^(T)), where ∘ is a modified matrixmultiplication operation, wherein distances are computed between wordvectors of W and T₂[j]^(T), and wherein row-wise minimums of W ∘T₂[j]^(T) are computed to obtain min(W ∘ T₂[j]^(T)), so as to deriveminimal distances between the words in the vocabulary V and the words inX₂[j], which minimal distances are stored in the vector y.
 5. The methodaccording to claim 4, wherein for each word w ∈{1, . . . , |V|} in thevocabulary V, a row W[w] of the dense matrix W stores a vector of Mfloating point numbers obtained via a word embedding process, wherebyeach word w is mapped to a vectors of real numbers.
 6. The methodaccording to claim 5, wherein said distances between pairs of histogramsare computed so as for a time duration complexity of the method to beO(N₂|V|d₂ M+N₁ N₂ d₁), wherein N₁ and N₂ are respective numbers ofhistograms of the two sets, d₁ and d₂ are respective, average dimensionsof histograms of the two sets, and |V| is a dimension of the vocabularyV.
 7. The method according claim 5, wherein said distances between pairsof histograms are computed so as for a storage space complexity of themethod to be O(N₁ d₁+|V|M), wherein N₁ is a number of histograms of theset X₁, d₁ is an average dimension of histograms of this set, and |V| isa dimension of the vocabulary V.
 8. The method according claim 5,wherein said distances between pairs of histograms are computed so asfor a time duration complexity of the method to be O(N₂|V|d₂ M+N₁ N₂d₁), and a storage space complexity of the method to be O(N₁ d₁+|V|M),wherein N₁ and N₂ are respective numbers of histograms of the two sets,d₁ and d₂ are respective, average dimensions of histograms of the twosets, and |V| is a dimension of the vocabulary V.
 9. The methodaccording to claim 4, wherein the modified matrix multiplicationoperation ∘ performed comprises: computing a norm of each row of W andeach row of T₂[j]; and multiplying W by the transpose of T₂[j], in orderto compute distances between each row of W and each row of T₂[j]. 10.The method according to claim 1, wherein the vocabulary V accessed isprocessed, prior to computing said distance between each word in V and aclosest entry in said each histogram X₂[j] for each histogram X₂[j] ∈ X₂and for each word in V, so as for the processed vocabulary to comprise aset of unique words used in the union of the two sets X₂ and X₁ ofhistograms.
 11. The method according to claim 10, wherein the methodfurther comprises, prior to computing distances between pairs (X₁[i],X₂[j]) of histograms, reducing the vocabulary V by removing words thatdo not appear in the set X₁ of histograms from the vocabulary V.
 12. Themethod according to claim 1, wherein pairs of steps of computing thedistance between said each word in V and a closest entry in said eachhistogram X₂[j] and computing the dense-vector multiplication betweenthe matrix-representation of the set X₁ and the vector y are computed inparallel, for said each histogram X₂[j] ∈ X₂.
 13. The method accordingto claim 12, wherein said pairs of steps are implemented using linearalgebra primitives of general-purpose computing on graphics processingunits and execution of such pairs of steps is distributed across aduster of graphics processing units.
 14. The method according to claim1, wherein the step of computing distances between pairs (X₁[i], X₂[j]),∀ i ∈ {1, . . . , N₁}∧∀ j ∈ {1, . . . , N₂} of histograms is a firstdistance computation step, whereby distances resulting from the firstdistance computation step are stored in a first data structurerepresentable as a N₁×N₂ matrix D₁, and wherein the method furthercomprises: a second distance computation step, wherein distances arecomputed between pairs (X₂[j], X₁[i]) of histograms, ∀ j ∈ {1, . . . ,N₂} ∧ ∀ i ∈ {1, . . . , N₁}, by: computing, for each histogram X₁[i] ∈X₁, and for each word in V, a distance between said each word in V and aclosest entry in said each histogram X₁[i], to obtain a dense,floating-point vector y₂, whose dimension is determined by the dimensionof the vocabulary V; and computing, for said each histogram X₁[i] ∈ X₁,a sparse-matrix, dense-vector multiplication between amatrix-representation of the set X₂ of histograms and the vector y₂, asa matrix-vector product, to obtain distances between all histogramsX₂[j], j ∈ {1, . . . , N₂} of the set X₂ and said each histogram X₁[i],whereby the obtained distances are stored in a second data structurerepresentable as a N₂×N₁ matrix D₂, and wherein a third data structureis further obtained by computing maximum values of symmetric entries inmatrices D₁ and D₂, the third data structure representable as a matrixD, such that D=max(D₁, D₂ ^(T)), whereby the distances between pairs ofdocuments is assessed according to the third data structure obtained.15. The method according to claim 14, wherein the vocabulary taken intoaccount for each of the first distance computation step and the seconddistance computation step is obtained from an initial vocabulary V₀ thatcomprises a set of unique words used in the union of the two sets X₂ andX₁ of histograms.
 16. The method according to claim 15, wherein thevocabulary V₁ taken into account for the first distance computation stepis obtained by removing words that do not appear in the set X₁ ofhistograms from the initial vocabulary V₀, and the vocabulary T₂ takeninto account for the second distance computation step is obtained byremoving words that do not appear in the set X₂ of histograms from theinitial vocabulary V₀.
 17. The method according to claim 14, wherein atthe first distance computation step, each pair of steps of computing thedistance 5between said each word in V and a closest entry in said eachhistogram X₂[j] and computing the dense-vector multiplication betweenthe matrix-representation of the set X₁ and the vector y, is computed inparallel, for said each histogram X₂[j] ∈ X₂, and at the second distancecomputation step, each pair of steps of computing the distance betweensaid each word in V and a closest entry in said each histogram X₁[i] andcomputing the dense-vector multiplication between thematrix-representation of the set X₂ and the vector y₂, is computed inparallel, for said each histogram X₁[i] ∈ X₁.
 18. The method accordingto claim 14, wherein each of the first distance computation step and thesecond distance computation step is implemented using linear algebraprimitives of general-purpose computing on graphics processing units,and execution of each of these steps is distributed across a cluster ofgraphics processing units.
 19. The method according to claim 1, whereinthe method further comprises: storing, for each j, distances resultingfrom computing distances between pairs of histograms in a first datastructure representable as a vector D₁[:, j] whose dimension is equal toN₁, whereby each entry in the vector D₁[:, j] is associated with anhistogram of the set X₁; sorting the first data structure D₁[:, j] inascending order of distances stored thereon, so as for a first entry tocorrespond to a smallest one of the distances stored in the vector D₁[:,j]; cropping the first data structure D₁[:, j] so as to obtain a first,cropped data structure D^(c) ₁[:, j], wherein only K entries areretained, which correspond to the K smallest distances of the first datastructure D₁[:, j]; identifying a reduced set X^(c) ₁ of histogramsassociated to D^(c) ₁[:,j], based on histograms corresponding todistances stored in D^(c) ₁[:, j], wherein the reduced set. X^(c) ₁corresponds to a reduced version of the first set X₁; applying aquadratic-complexity, Relaxed Word Mover's Distance method to compute acost of moving a j^(th) histogram X₂[j] of the second set X₂ into thereduced set X^(c) ₁ of histograms, so as to obtain a second datastructure D^(c) ₂[:, j]; and computing a quantity D^(c)[:, j]=max(D^(c)₁[:, j], D^(c) ₂[:, j], based on which distances between pairs ofdocuments can be assessed
 20. The method according to claim 1, wherein aratio N₁d₁/|V| is larger than or equal to 1, wherein N₁ is a number ofhistograms of the set X₁, d₁ is an average dimension of histograms ofthis set, and |V| is a dimension of the vocabulary V.
 21. The methodaccording to claim 1, wherein each of the two sets X₂ and X₁ accessedcomprises at least 10 000 histograms.
 22. The method according to claim1, wherein one or more of the histograms of each of the sets X₂ and X₁accessed comprises at least 10 000 distinct words.
 23. Acomputer-implemented method of managing sets of documents, the methodcomprising: given a set X₁ of histograms of words, each of thehistograms pertaining to a respective document, whereby the set X₁corresponds to a first set of documents as currently stored in a pool ofdocuments, receiving a new set of documents, the latter being a secondset of documents; creating a corresponding set X₂ of histograms ofwords, each of the histograms pertaining to a respective one of thedocuments of the new set of documents received; and assessing distancesbetween pairs of documents of the two sets, by: accessing: two sets X₂and X₁ of histograms of words, each of the histograms pertaining to arespective document; and a vocabulary V of words, wherein each of thetwo sets X_(k), k=1, 2, is representable as a sparse matrix, each row ofwhich corresponds to a histogram X_(k)[l] ∈ X_(k), l ∈ {1, . . . ,N_(k)}, and wherein each histogram X_(k)[l] is representable as a sparsevector, whose dimension is determined by a dimension of the vocabularyV; and computing distances between pairs of histograms (X₁[i], X₂[j]), ∀i ∈ {1, . . . , N₁}∧∀ j ∈ {1, . . . , N₂}, by: computing, for eachhistogram X₂[j] ∈ X₂, and for each word in V, a distance between saideach word in V and a closest entry in said each histogram X₂[j], toobtain a dense, floating-point vector y, whose dimension is determinedby the dimension of the vocabulary V; and computing, for said eachhistogram X₂[j] ∈ X₂, a sparse-matrix, dense-vector multiplicationbetween a matrix-representation of the set X₁ of histograms and thevector y, as a matrix-vector product, to obtain distances between allhistograms X₁[i], i ∈ {1, . . . , N₁} of the set X₁ and said eachhistogram X₂[j], based on which distances between pairs of documents canbe assessed
 24. The method according to claim 23, further comprising,after having assessed distances between pairs of documents of the twosets: storing the new set of documents in the pool of documents.
 25. Acomputer program product for assessing distances between pairs ofdocuments, the computer program product comprising a computer readablestorage medium having program instructions embodied therewith, theprogram instructions executable by one or more processors, to cause to:access: two sets X₂ and X₁ of histograms of words, each of thehistograms pertaining to a respective document; and a vocabulary V ofwords, wherein each of the two sets X_(k), k=1, 2, is representable as asparse matrix, each row of which corresponds to a histogram X_(k)[l] ∈X_(k), l ∈ {1, . . . , N_(k)}, and wherein each histogram X_(k)[l] isrepresentable as a sparse vector, whose dimension is determined by adimension of the vocabulary V: and compute distances between pairs ofhistograms (X₁[i], X₂[j]), X₂[j]), ∀ i ∈ {1, . . . , N₁}∧∀ j ∈ {1, . . ., N₂}, by: computing, for each histogram X₂[j] ∈ X₂, and for each wordin V, a distance between said each word in V and a closest entry in saideach histogram X₂[j], to obtain a dense, floating-point vector y, whosedimension is determined by the dimension of the vocabulary V; andcomputing, for said each histogram X₂[j] ∈ X₂, a sparse-matrix,dense-vector multiplication between a matrix-representation of the setX₁ of histograms and the vector y, as a matrix-vector product, to obtaindistances between all histograms X₁[i], i ∈ {1, . . . , N₁} of the setX₁ and said each histogram X₂[j], based on which distances between pairsof documents can be assessed.